464 research outputs found
S3Aug: Segmentation, Sampling, and Shift for Action Recognition
Action recognition is a well-established area of research in computer vision.
In this paper, we propose S3Aug, a video data augmenatation for action
recognition. Unlike conventional video data augmentation methods that involve
cutting and pasting regions from two videos, the proposed method generates new
videos from a single training video through segmentation and label-to-image
transformation. Furthermore, the proposed method modifies certain categories of
label images by sampling to generate a variety of videos, and shifts
intermediate features to enhance the temporal coherency between frames of the
generate videos. Experimental results on the UCF101, HMDB51, and Mimetics
datasets demonstrate the effectiveness of the proposed method, paricularlly for
out-of-context videos of the Mimetics dataset
Joint learning of images and videos with a single Vision Transformer
In this study, we propose a method for jointly learning of images and videos
using a single model. In general, images and videos are often trained by
separate models. We propose in this paper a method that takes a batch of images
as input to Vision Transformer IV-ViT, and also a set of video frames with
temporal aggregation by late fusion. Experimental results on two image datasets
and two action recognition datasets are presented.Comment: MVA2023 (18th International Conference on Machine Vision
Applications), Hamamatsu, Japan, 23-25 July 202
Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition
We propose Multi-head Self/Cross-Attention (MSCA), which introduces a
temporal cross-attention mechanism for action recognition, based on the
structure of the Multi-head Self-Attention (MSA) mechanism of the Vision
Transformer (ViT). Simply applying ViT to each frame of a video frame can
capture frame features, but cannot model temporal features. However, simply
modeling temporal information with CNN or Transfomer is computationally
expensive. TSM that perform feature shifting assume a CNN and cannot take
advantage of the ViT structure. The proposed model captures temporal
information by shifting the Query, Key, and Value in the calculation of MSA of
ViT. This is efficient without additional coinformationmputational effort and
is a suitable structure for extending ViT over temporal. Experiments on
Kineitcs400 show the effectiveness of the proposed method and its superiority
over previous methods.Comment: 9 page
画像中の物体および人物領域の抽出手法に関する研究
名古屋大学 (Nagoya University)博士(工学)Engineeringdoctora
プログラミング実習における新しい教材とその指導方法
平成16年度工学・工業教育研究講演会, スライド ; 開催場所 : 金沢工業大学, 石川県 ; 開催日 : 2004年7
- …